Casting doubt upon scientific findings


Plos Medicine, 2005
Some links to its media coverage

This I believe in genetics: discovery can be a nuisance, replication is science, implementation matters
John P. A. Ioannidis, 2013, frontiers in Genetics

— .class #id

Reproducibility in recent scientific publications

Science, December 2011
Barbara R. Jasny, Gilbert Chin, Lisa Chong, Sacha Vignieri


Science: again, and again, and again…

How do we promote the publication of replicable data?

Scientific gold standard

Replication: the confirmation of results and conclusions from one study obtained independently in another.

Difficulties

  • New tools and technologies
  • Massive amounts of data
  • Long-term studies
  • Interdisciplinary approaches
  • Complex questions

Attempts to replicate reveal scientific uncertainties


Nature announces revised standards…

Special collection, April 2013


…and some picks from the Nature collection


…more picks from Nature

Reproducibility: Six red flags for suspect work
C. Glenn Begley explains how to recognize the preclinical papers in which the data won’t stand up. Nature (23 May 2013)

Reproducibility: The risks of the replication drive
The push to replicate findings could shelve promising research and unfairly damage the reputations of careful, meticulous scientists, says Mina Bissell. Nature (20 November 2013)


The economist kicks in…

Problems with scientific research

How science goes wrong
Scientific research has changed the world. Now it needs to change itself
Trouble at the lab
Scientists like to think of science as self-correcting. To an alarming degree, it is not
Oct 19th 2013


…and the New York Times

New Truths That Only One Can See. George Johnson, JAN. 20, 2014

The Duke saga: Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology
Ann. Appl. Stat., Volume 3, Number 4 (2009), 1309-1334.
Keith A. Baggerly and Kevin R. Coombes. A starter set, Video lecture, 60 Minutes


The statistics take…

Reproducible research and Biostatistics
Editorial 2009, Roger Peng

An estimate of the science-wise false discovery rate and application to the top medical literature
Leah R. Jager and Jeffrey T. Leek Biostatistics (2014)
with mixed discussions by top-class statisticians (including Yoav Benjamini, David R. Cox, Andrew Gelman, John P. A. Ioannidis)


Some tools for reproducible research


Some tools for reproducible research


DOs and DON’T (by Roger Peng)


* Yes! Teach a computer * Yes! Version control * Yes! Keep track of the software environment (Computer architecture, OS, softwares and packages, …) * Yes! Set your seed (but with caution!) * Yes! Think of the entire pipeline (Raw data → processed data → analysis → report)


Writing a reproducible report for getting in the news

So let’s look at how we can get a dynamic and reproducible report with knitr, for getting in the news.

The knitr process of an evolving document

knittable document (Markup + code chunks) \(\longrightarrow\) Markup only \(\longrightarrow\) Presentation

Markdown example

gettingTheNews.Rmd \(\longrightarrow\) gettingTheNews.md \(\longrightarrow\) gettingTheNews.html

\({\rm \LaTeX}\) example

gettingTheNews.Rnw \(\longrightarrow\) gettingTheNews.tex \(\longrightarrow\) gettingTheNews.pdf

Only ever edit the .Rmd or .Rnw files (intermediate changes will be lost when reprocessing them)

!!! The code needs to work for the document to be produced!


Markdown essentials

Headings

## I am a big header... 
### and I am a smaller one

I am a big header…

and I am a smaller one

Italics and bold

*We are italic* and _We are italic_
**We are bold** and __We are bold__

We are italic and We are italic
We are bold and We are bold

Enter two spaces to go to a new line


Markdown essentials

Unordered lists

Unordered List
* Things not to forget
* in no particular order
  * not me
  * and me
  • Things not to forget
  • in no particular order
  • not me
  • and me


Markdown essentials

Ordered List

1. I am the most important
2. Maybe I will be top soon
3. Running for the second
   * Won't be last
   * Anybody down there?
  1. I am the most important
  2. Maybe I will be top soon
  3. Running for the second
  • Won’t be last
  • Anybody down there?


Markdown essentials

Code chunks

```{r mdLabel}
rnorm(10)
```
##  [1] -0.22602  0.61757 -0.77988  1.20188  0.75303  0.04728  0.86098
##  [8] -0.59159 -0.35614  0.16592

Markdown essentials

\({\rm \LaTeX}\) equations and plots

$\frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{(x-\mu)^2}{2\sigma^2})$ \(\frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{(x-\mu)^2}{2\sigma^2})\)

nS <- 1000; y <- data.frame(index = (1:nS), normS=rnorm(nS))
ggplot(y, aes(x=normS)) + geom_histogram(aes(y=..density..), binwidth=.2, 
  colour="darkblue", fill="white") + geom_density(alpha=.1, colour=joran, fill="#FFFF66",
    size=.8) + labs(y="", x="", title="Kernel density estimation") + 
      stat_function(fun = dnorm, colour = jpurp, size=.8)

plot of chunk Norm


Markdown essentials

Syntax for including local images

Markdown: ![title](path/to/image), HTML: <img src="path/to/image" />

Writing a cool story

<img src="figure/snoobybanner.jpg" height="250px"" />

Essential, but sufficient in most situations… and pretty amazing for drafting!


The Markdown news document

Make a new .Rmd file, here gettingTheNews.Rmd, the beginning of which looks like this:

**Giusi Moffa**  
**Statistical bioinformatics, University of Regensburg**  
**Practical bioinformatics, 27 May 2014**  

p-values, one big topic for reproducible findings
========================================================

So let's look at how we can get a dynamic and reproducible 
report with `knitr`, for getting in the news.


### Recording the $R$ session
It may be useful since different versions in general will not produce identical results.

    ```{r}
    # Print R session info
    sessionInfo()
    ```
    
It is a good habit also to set your working directory and check whether you are 
in the right place.

    ```{r}
    setwd("~/juicy/BioStatWork2012/mytex/journalClub/Reproducibility/repBioInfo")
    getwd() ## obtain the working directory
    ``` 
    
![Jelly Beans](figure/significant.png)    

## One old trick for getting in the news

Under the null-hypothesis p-values are expected to be uniformly distributed 
between 0 and 1. So if we have hundreds of crazy hypotheses we can try 
a fishing expedition for significance by testing all of them. 
On average 5% will come up as significant.

    ```{r, fig.height=3}
    set.seed(31)
    nColors <- 200
    nObs <- 100
    jellyBeans <- matrix(rnorm(nObs*nColors), ncol=nColors)
    fishing4News <- apply(jellyBeans, 2, function(x) t.test(x)$p.value)
    hist(fishing4News, col=4, freq=FALSE, main="Distribution of p-values", 
        xlab="p-values")
    lines(density(fishing4News), col=2, lwd=2)    
    ```

Create HTML and R files from Rmd

Compile a html file with knit2html("gettingTheNews.Rmd")

Equivalent to

knit("gettingTheNews.Rmd")
markdownToHTML("gettingTheNews.md")

The output will be gettingTheNews.html

You can try to get a pdf from your markdown file with pandoc

Extract R code as before with purl

purl("gettingTheNews.Rmd")

The output file will be gettingTheNews.R


Back to the news via Markdown


### Recording the \(R\) session

It may be useful since different versions in general will not produce identical results.


To be continued…


# Print R session info
sessionInfo()
## R version 3.0.3 (2014-03-06)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_1.0.0
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5   formatR_0.10    
##  [5] grid_3.0.3       gtable_0.1.2     htmltools_0.2.4  knitr_1.6       
##  [9] labeling_0.2     MASS_7.3-33      munsell_0.4.2    plyr_1.8.1      
## [13] proto_0.3-10     Rcpp_0.11.2      reshape2_1.4     rmarkdown_0.2.49
## [17] scales_0.2.4     stringr_0.6.2    tools_3.0.3      yaml_2.1.13

The news document

It is a good habit also to set your working directory and check whether you are in the right place.

setwd("~/juicy/BioStatWork2012/mytex/journalClub/Reproducibility/repBioInfo")
getwd() ## obtain the working directory
## [1] "/Users/giusimoffa/juicy/BioStatWork2012/mytex/journalClub/Reproducibility/repBioInfo"

p-values, one big topic for reproducible findings

Under the null-hypothesis p-values are expected to be uniformly distributed between 0 and 1. So if we have hundreds of crazy hypotheses we can try a fishing expedition for significance by testing all of them. On average 5% will come up as significant.

set.seed(31); nColors <- 200; nObs <- 100
jellyBeans <- matrix(rnorm(nObs*nColors), ncol=nColors)
fishing4News <- apply(jellyBeans, 2, function(x) t.test(x)$p.value)
hist(fishing4News, col=4, freq=FALSE, main="Distribution of p-values", xlab="p-values")
lines(density(fishing4News), col=2, lwd=2)

plot of chunk unnamed-chunk-3


This way we get 7 significant results. This problem is well known in genomics and adressed by a plethora of methods for multiple correction. Transparency however is not always necessarily guaranteed, especially in in fields such as social science, and the issue is still a very much debated hot topic


Correction for multiple testing

One way to account for multiple testing is by adopting the Benjamini-Hochberg correction. However if we are “lucky” we can still get in the news even if we correct for multiple testing.

set.seed(7)
nColors <- 10000
nObs <- 20
jellyBeans <- matrix(rnorm(nObs*nColors), ncol=nColors)
fishing4News <- apply(jellyBeans, 2, function(x) t.test(x)$p.value)
BHfishing <- p.adjust(fishing4News, "BH")
min(BHfishing)
## [1] 0.01483

And if you really feel you need to, you can also save (and load) your entire workspace

save.image(file="fishing.RData")
load(file="fishing.RData")

Disclaimer

This report is freely available for the benefit of science, so that our steps on the way to the news can be checked by anybody who wishes to do so.


Reproducing this presentation

Installing and loading packages

Install knitr (markdown should be installed by default) as usual, and slidify from github using the devtools package.

install.packages("knitr")
install.packages('devtools')
require(devtools)
pkgs <- c("slidify", "slidifyLibraries", "rCharts")
install_github(pkgs, 'ramnathv', ref = 'dev')

Load packages and compile slides

library('knitr'); library('slidify')
setwd("~/path/to/repBioInfo") # edit ~/path/to/repBioInfo/index.Rmd
slidify("index.Rmd")

Making your own slidified presentation

Knitting and slidifying

  1. Write using R Markdown
  2. Use an empty line followed by three dashes to separate slides!

Making your own deck

slidify("index.Rmd")
browseURL("index.html")
author("mydeck")
# edit ~/path/to/mydeck/index.Rmd

It can be easily shared with the publish command to github, dropbox or Rpubs

publish('myDeck', host = "dropbox")

More sophisticated documents

\({\rm \LaTeX}\) + knitr \(\longrightarrow\) Rnw files, here gettingTheNews.Rnw

The equivalent of the markdown file will look as follows

\documentclass{article}
\usepackage{hyperref}
\hypersetup{colorlinks,% 
                 urlcolor=blue}
\author{Giusi Moffa \\[2ex] Statistical bioinformatics, University of Regensburg\\[4ex]}
\title{A \texttt{knitr} + \LaTeX~ dynamic report.\\[4ex]}
\date{27 May 2014 \\[4ex] Practical bioinformatics}

\begin{document}
\maketitle
\section*{p-values, one big topic for reproducible findings}
So let's look at how we can get a dynamic and reproducible report with `knitr`, 
for getting in the news.

\subsubsection*{Recording the $R$ session.}
It may be useful since different versions in general will not produce identical results.

<<rInfo, results='asis'>>=
# Print R session info
toLatex(sessionInfo())
@

It is a good habit also to set your working directory and check whether you are 
in the right place.

<<rWD>>=
setwd("~/juicy/BioStatWork2012/mytex/journalClub/Reproducibility/repBioInfo")
getwd() ## obtain the working directory
@

\subsubsection*{One old trick for getting in the news}
\includegraphics[width=.6\textwidth]{figure/significant.png}
%\clearpage

Under the null-hypothesis p-values are expected to be uniformly distributed 
between 0 and 1. So if we have hundreds of crazy hypotheses we can try a 
fishing expedition for significance by testing all of them. 
On average 5% will come up as significant.

<<rSimul, fig.height=3>>=
set.seed(31)
nColors <- 200
nObs <- 100
jellyBeans <- matrix(rnorm(nObs*nColors), ncol=nColors)
fishing4News <- apply(jellyBeans, 2, function(x) t.test(x)$p.value)
@

Create pdf and R files from Rnw

Compile a pdf file with knit2pdf("gettingTheNews.Rnw")

Equivalent to

knit("gettingTheNews.Rnw")
library(tools)
texi2pdf("gettingTheNews.tex")

The output will be gettingTheNews.pdf

Extract R code with purl

purl("gettingTheNews.Rnw")

The output file will be gettingTheNews.R


More knitr basics

\({\rm \LaTeX}\) code chunks

Started with << >>= and ended with @

<<rLabel, eval=FALSE>>=
  y <- rnorm(1000)
  library(ggplot2)
  joran <- rgb(.8,.4,0)
  jpurp <- rgb(.5,0,1)
  hist(y, col=4)
@
  • Inline code is inserted with Sexpr{}

knitr quick reference


A selection of chunk options


References and credits

knitr

slidify

Statistical Analyses and Reproducible Research (Gentleman and Lang, 2004)


Further (blog) readings

Reproducible Research

Evidence-based Data Analysis: Treading a New Path for Reproducible Research:
Part 1, Part 2, Part 3

Reproducible research: Notes from the field

Replication and validation in -omics studies - just as important as reproducibility

Tools for Reproducible Research, by Karl Broman

Resources and further reading
Top Ten Reasons To Not Share Your Code, and why you should anyway. Randall J. LeVeque (Siam news, April 1, 2013)

Publish your computer code: it is good enough. Nick Barnes (Nature, 13 October 2010)


From science to anecdotes… and viceversa